After training models, you’ll learn how to assess them in this chapter. You’ll learn how to analyze classification model performance using scikit-learn by using several metrics and a visualization technique. Using hyperparameter tuning, you will also be able to optimize classification and regression models.
This is my learning experience of data science through DataCamp
Optimizing your model
After training your model, we must evaluate its performance. In this section, we will explore some of the other metrics available in scikit-learn for assessing our model’s performance. Using hyperparameter tuning, you can optimize your classification and regression models.
Classification metrics
Chapter 1 evaluated the accuracy of your k-NN classifier. As Andy discussed, accuracy is not always an informative metric. By computing a confusion matrix and generating a classification report, you will evaluate the performance of binary classifiers.
The classification report consisted of three rows and an additional support column, as shown in the video. In the video example, the support was the number of Republicans or Democrats in the test set used to compute the classification report. These columns gave the precision, recall, and f1-score metrics for that particular class.
This tutorial uses the PIMA Indians dataset available at the UCI Machine Learning Repository. Based on factors such as BMI, age, and number of pregnancies, we can predict whether or not a given female patient will develop diabetes. As a result, it is a binary classification problem. Diabetes is not present in a patient with a target value of 0, whereas diabetes is present in a patient with a target value of 1. To deal with missing values, the dataset has been preprocessed in earlier excercises.
Code
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.metrics import classification_reportfrom sklearn.metrics import confusion_matrix
Code
df=pd.read_csv('diabetes.csv')df.insulin.replace(0, np.nan, inplace=True)df.triceps.replace(0, np.nan, inplace=True)df.bmi.replace(0, np.nan, inplace=True)df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: x.fillna(x.mean()))y = df['diabetes']X = df.drop('diabetes', axis=1)# Create training and test setX_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42)# Instantiate a k-NN classifier: knnknn = KNeighborsClassifier(n_neighbors=6)# Fit the classifier to the training dataknn.fit(X_train,y_train)# Predict the labels of the test data: y_predy_pred = knn.predict(X_test)# Generate the confusion matrix and classification reportprint(confusion_matrix(y_test, y_pred))print(classification_report(y_test, y_pred))
By analyzing the confusion matrix and classification report, you can get a much better understanding of your classifier’s performance.
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate. False Positive Rate
Building a logistic regression model
Time to build your first logistic regression model! As Hugo showed in the video, scikit-learn makes it very easy to try different models, since the Train-Test-Split/Instantiate/Fit/Predict paradigm applies to all classifiers and regressors - which are known in scikit-learn as ‘estimators’. You’ll see this now for yourself as you train a logistic regression model on exactly the same data as in the previous exercise. Will it outperform k-NN? There’s only one way to find out!
Code
# Import the necessary modulesfrom sklearn.linear_model import LogisticRegression# Create training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.4, random_state=42)# Create the classifier: logreglogreg = LogisticRegression(solver="liblinear")# Fit the classifier to the training datalogreg.fit(X_train,y_train)# Predict the labels of the test set: y_predy_pred = logreg.predict(X_test)# Compute and print the confusion matrix and classification reportprint(confusion_matrix(y_test, y_pred))print(classification_report(y_test, y_pred))
The previous exercise was a great success - you now have a new classifier in your toolbox!
Model performance can be evaluated quantitatively using classification reports and confusion matrices, while visually using ROC curves. Most scikit-learn classifiers have a .predict_proba() method that returns the probability of a given sample being in a particular class, as Hugo demonstrated in the video. After building a logistic regression model, you will plot an ROC curve to evaluate its performance. As a result, you will become familiar with the .predict_proba() method.
You’ll continue working with the PIMA Indians diabetes dataset here
We may have noticed that your ROC curve’s y-axis (True positive rate) is also known as recall. ROC curves are not the only way to evaluate model performance visually. Precision-recall curves are generated by plotting precision and recall at different thresholds. Recall that precision and recall are defined as follows:
The precision-recall curve for the diabetes dataset can be seen below. IPython Shell displays the classification report and confusion matrix.
Take a look at the precision-recall curve and then consider the following statements. Pick the statement that is not true. If the individual has diabetes, the class is positive (1).
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifier#df = pd.read_csv('diabetes.csv')df.insulin.replace(0, np.nan, inplace=True)df.triceps.replace(0, np.nan, inplace=True)df.bmi.replace(0, np.nan, inplace=True)df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: x.fillna(x.mean()))y = df['diabetes']X = df.drop('diabetes', axis=1)# Import necessary modulesfrom sklearn.metrics import classification_reportfrom sklearn.metrics import confusion_matrix# Create training and test setX_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.4, random_state=42)# Instantiate a k-NN classifier: knnknn = KNeighborsClassifier(n_neighbors=6)# Fit the classifier to the training dataknn.fit(X_train, y_train)# Predict the labels of the test data: y_predy_pred = knn.predict(X_test)# Generate the confusion matrix and classification reportprint(confusion_matrix(y_test, y_pred))print(classification_report(y_test, y_pred))
Suppose we have a binary classifier that makes guesses at random. It would be correct approximately 50% of the time, and the ROC curve would be a diagonal line where the True Positive Rate and False Positive Rate are always equal. This ROC curve has an Area under it of 0.5. Hugo discussed the AUC in the video as an informative metric for evaluating models. The model is better than random guessing if the AUC is greater than 0.5. Always a good sign!
We will calculate AUC scores on the diabetes dataset using the roc_auc_score() function from sklearn.metrics.
Using GridSearchCV on the voting dataset, we tune the n_neighbors parameter of KNeighborsClassifier(). Now we will practice this thoroughly using logistic regression on the diabetes dataset instead!
We saw earlier that logistic regression also has a regularization parameter: C. This parameter controls the inverse of regularization strength. A large number can result in an overfit model, while a small number can result in an underfit model.
You have been provided with the hyperparameter space for C. We will use GridSearchCV and logistic regression to find the optimal C. X represents the feature array and Y represents the target variable array.
This is why we haven’t separated the data into training and test sets. I agree with your observation! We will focus on setting up the hyperparameter grid and performing grid-search cross-validation. In practice, we will indeed want to hold out a portion of your data for evaluation purposes, and you will learn all about this in the next video!
Code
# Import necessary modulesfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import GridSearchCV# Setup the hyperparameter gridc_space = np.logspace(-5, 8, 15)param_grid = {'C': c_space}# Instantiate a logistic regression classifier: logreglogreg = LogisticRegression(solver="liblinear")# Instantiate the GridSearchCV object: logreg_cvlogreg_cv = GridSearchCV(logreg, param_grid, cv=5)# Fit it to the datalogreg_cv.fit(X,y)# Print the tuned parameters and scoreprint("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))print("Best score is {}".format(logreg_cv.best_score_))
Tuned Logistic Regression Parameters: {'C': 3.727593720314938}
Best score is 0.7708768355827178
Hyperparameter tuning with RandomizedSearchCV
GridSearchCV is computationally expensive, especially when dealing with multiple hyperparameters and large hyperparameter spaces. To solve this problem, RandomizedSearchCV can be used, in which not all hyperparameter values are tested. A fixed number of hyperparameter settings is instead sampled from specified probability distributions. In this exercise, you will practice using RandomizedSearchCV.
You will also be introduced to a new model: the Decision Tree. You don’t need to worry about the specifics of how this model works. In scikit-learn, decision trees also have .fit() and .predict() methods, just like k-NN, linear regression, and logistic regression. In RandomizedSearchCV, decision trees are ideal because they have many parameters that can be tuned, such as max_features, max_depth, and min_samples_leaf.
The diabetes dataset has been preloaded with the feature array X and target variable array Y. You have been given the hyperparameter settings. To determine the optimal hyperparameters, you will use RandomizedSearchCV
Code
# Import necessary modulesfrom scipy.stats import randintfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.model_selection import RandomizedSearchCV# Setup the parameters and distributions to sample from: param_distparam_dist = {"max_depth": [3, None],"max_features": randint(1, 9),"min_samples_leaf": randint(1, 9),"criterion": ["gini", "entropy"]}# Instantiate a Decision Tree classifier: treetree = DecisionTreeClassifier()# Instantiate the RandomizedSearchCV object: tree_cvtree_cv = RandomizedSearchCV(tree, param_dist, cv=5)# Fit it to the datatree_cv.fit(X,y)# Print the tuned parameters and scoreprint("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))print("Best score is {}".format(tree_cv.best_score_))
Tuned Decision Tree Parameters: {'criterion': 'entropy', 'max_depth': 3, 'max_features': 5, 'min_samples_leaf': 5}
Best score is 0.7422629657923776
Hold-out set in practice I: Classification
We will now practice evaluating a model with tuned hyperparameters on a hold-out set. X and Y have been preloaded from the diabetes dataset as feature arrays and target variable arrays, respectively.
In addition to CC, logistic regression also has a ‘penalty’ hyperparameter that specifies whether ‘l1’ or ‘l2’ regularization should be used. This exercise requires you to create a hold-out set, tune the ‘C’ and ‘penalty’ hyperparameters of a logistic regression classifier using GridSearchCV.
Code
# Import necessary modulesfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import GridSearchCV# Create the hyperparameter gridc_space = np.logspace(-5, 8, 15)param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}# Instantiate the logistic regression classifier: logreglogreg = LogisticRegression(solver='liblinear')# Create train and test setsX_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42)# Instantiate the GridSearchCV object: logreg_cvlogreg_cv = GridSearchCV(logreg,param_grid,cv=5)# Fit it to the training datalogreg_cv.fit(X_train,y_train)# Print the optimal parameters and best scoreprint("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))print("Tuned Logistic Regression Accuracy: {}".format(logreg_cv.best_score_))
Remember lasso and ridge regression from the previous chapter? Lasso used the penalty to regularize, while ridge used the penalty. There is another type of regularized regression known as the elastic net. In elastic net regularization, the penalty term is a linear combination of the L1 and L2 penalties:
In scikit-learn, this term is represented by the ‘l1_ratio’parameter: An ’l1_ratio’ of 1 corresponds to an L1L1penalty, and anything lower is a combination of L1 and L2.
In this exercise, you will GridSearchCV to tune the ‘l1_ratio’ of an elastic net model trained on the Gapminder data. As in the previous exercise, use a hold-out set to evaluate your model’s performance.
from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV from sklearn.model_selection import train_test_split
Now that we have basic understanding how to fine-tune your models, it’s time to learn about preprocessing techniques and how to piece together all the different stages of the machine learning process into a pipeline!